-
Notifications
You must be signed in to change notification settings - Fork 92
add dmrlet - lightweight node agent for Docker Model Runner #627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @ericcurtin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - I've found 4 issues, and left some high level feedback:
- In
pkg/dmrlet/container/manager.go,NewManagertakes acontainerdaddress but unconditionally selects a Docker CLI runtime and ignores the address, which is confusing given the daemon config and README; consider either wiring the address into an actual containerd-based runtime or renaming/removing the parameter to match the current behavior. - In
daemon.scaleUpyou ignore errors fromcontainer.NewSpecBuilderandmodelStore.GetModelPath(using_for the error), which can lead to creating containers with an empty model path or an unsupported backend; propagate or handle these errors so scaling up fails fast instead of silently misconfiguring replicas. - The daemon client’s error handling in
pkg/dmrlet/daemon/api.go(Client.Serve) assumes the error body is JSON-decodable into a string, but the server useshttp.Error(plain text), so the decode will likely fail and drop the real message; consider reading the body as raw bytes for non-200 responses and returning that content directly in the error.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `pkg/dmrlet/container/manager.go`, `NewManager` takes a `containerd` address but unconditionally selects a Docker CLI runtime and ignores the address, which is confusing given the daemon config and README; consider either wiring the address into an actual containerd-based runtime or renaming/removing the parameter to match the current behavior.
- In `daemon.scaleUp` you ignore errors from `container.NewSpecBuilder` and `modelStore.GetModelPath` (using `_` for the error), which can lead to creating containers with an empty model path or an unsupported backend; propagate or handle these errors so scaling up fails fast instead of silently misconfiguring replicas.
- The daemon client’s error handling in `pkg/dmrlet/daemon/api.go` (`Client.Serve`) assumes the error body is JSON-decodable into a string, but the server uses `http.Error` (plain text), so the decode will likely fail and drop the real message; consider reading the body as raw bytes for non-200 responses and returning that content directly in the error.
## Individual Comments
### Comment 1
<location> `pkg/dmrlet/daemon/daemon.go:590-593` </location>
<code_context>
+ return d.logAggregator.StreamLogs(context.Background(), deployment.Containers[0], lines, follow)
+}
+
+func (d *Daemon) allocatePort() int {
+ port := d.nextPort
+ d.nextPort++
+ return port
+}
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Port allocation is not concurrency-safe and can race between Serve/scaleUp calls
allocatePort updates d.nextPort without synchronization. Serve calls it under d.mu, but scaleUp calls it without holding the lock, so concurrent calls can race and assign duplicate ports. Please protect nextPort with d.mu (or an atomic), or move port assignment under a shared lock so all callers synchronize consistently.
</issue_to_address>
### Comment 2
<location> `pkg/dmrlet/daemon/api.go:500-503` </location>
<code_context>
+ }
+ defer resp.Body.Close()
+
+ if resp.StatusCode != http.StatusOK {
+ return c.fetchStatsEndpoint(ctx, endpoint)
+ }
</code_context>
<issue_to_address>
**issue (bug_risk):** Client.Serve error handling assumes JSON string body but server uses http.Error with plain text
Because the server uses http.Error for non-200 responses, the body is plain text. Decoding it as JSON into a string will usually fail and drop the real error message. Instead, read resp.Body as raw bytes and surface that content in the error (falling back to resp.Status if the body is empty), and apply the same pattern to the other client methods that ignore the error body.
</issue_to_address>
### Comment 3
<location> `README.md:427` </location>
<code_context>
+| Feature | Kubernetes | dmrlet |
+|---------|------------|--------|
+| Multi-GPU setup | Device plugins + node selectors + resource limits YAML | `dmrlet serve llama3 --gpus all` |
+| Config overhead | 50+ lines YAML minimum | Zero YAML, CLI-only |
+| Time to first inference | Minutes (pod scheduling, image pull) | Seconds (model already local) |
+| Model management | External (mount PVCs, manage yourself) | Integrated with Docker Model Runner store |
</code_context>
<issue_to_address>
**suggestion (typo):** Consider adding "of" for smoother grammar in this table entry.
Change “50+ lines YAML minimum” to “50+ lines of YAML minimum” or “at least 50 lines of YAML” for clearer grammar.
```suggestion
| Config overhead | 50+ lines of YAML minimum | Zero YAML, CLI-only |
```
</issue_to_address>
### Comment 4
<location> `README.md:502` </location>
<code_context>
+# DAEMON: running
+# SOCKET: /var/run/dmrlet.sock
+#
+# GPUS:
+# GPU 0: NVIDIA A100 80GB 81920MB (in use: llama3.2)
+# GPU 1: NVIDIA A100 80GB 81920MB (available)
</code_context>
<issue_to_address>
**issue (typo):** Typo: "GPUS" should be "GPUs".
In the status example, change the header label from "GPUS" to "GPUs" to use the correct plural form.
```suggestion
# GPUs:
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
pkg/dmrlet/daemon/daemon.go
Outdated
| func (d *Daemon) allocatePort() int { | ||
| port := d.nextPort | ||
| d.nextPort++ | ||
| return port |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Port allocation is not concurrency-safe and can race between Serve/scaleUp calls
allocatePort updates d.nextPort without synchronization. Serve calls it under d.mu, but scaleUp calls it without holding the lock, so concurrent calls can race and assign duplicate ports. Please protect nextPort with d.mu (or an atomic), or move port assignment under a shared lock so all callers synchronize consistently.
pkg/dmrlet/daemon/api.go
Outdated
| if resp.StatusCode != http.StatusOK { | ||
| var errMsg string | ||
| json.NewDecoder(resp.Body).Decode(&errMsg) | ||
| return nil, fmt.Errorf("daemon error: %s", errMsg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Client.Serve error handling assumes JSON string body but server uses http.Error with plain text
Because the server uses http.Error for non-200 responses, the body is plain text. Decoding it as JSON into a string will usually fail and drop the real error message. Instead, read resp.Body as raw bytes and surface that content in the error (falling back to resp.Status if the body is empty), and apply the same pattern to the other client methods that ignore the error body.
README.md
Outdated
| | Feature | Kubernetes | dmrlet | | ||
| |---------|------------|--------| | ||
| | Multi-GPU setup | Device plugins + node selectors + resource limits YAML | `dmrlet serve llama3 --gpus all` | | ||
| | Config overhead | 50+ lines YAML minimum | Zero YAML, CLI-only | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (typo): Consider adding "of" for smoother grammar in this table entry.
Change “50+ lines YAML minimum” to “50+ lines of YAML minimum” or “at least 50 lines of YAML” for clearer grammar.
| | Config overhead | 50+ lines YAML minimum | Zero YAML, CLI-only | | |
| | Config overhead | 50+ lines of YAML minimum | Zero YAML, CLI-only | |
README.md
Outdated
| # DAEMON: running | ||
| # SOCKET: /var/run/dmrlet.sock | ||
| # | ||
| # GPUS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (typo): Typo: "GPUS" should be "GPUs".
In the status example, change the header label from "GPUS" to "GPUs" to use the correct plural form.
| # GPUS: | |
| # GPUs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces dmrlet, a new container orchestrator for AI inference. The changes are extensive, adding a new CLI tool and several backend packages for managing containers, GPUs, services, and more. The overall architecture is well-designed, with clear separation of concerns between components like the daemon, container manager, GPU allocator, and service registry.
My review focuses on improving the robustness and correctness of the implementation. I've identified a few high-priority issues, including the use of the docker CLI instead of the Go SDK which can be brittle, and some bugs in the API client related to log streaming and error handling. I've also included some medium-severity suggestions to address potential race conditions, incomplete features, and hardcoded values.
Overall, this is a great addition with a solid foundation. Addressing these points will make dmrlet more reliable and maintainable.
pkg/dmrlet/container/manager.go
Outdated
| // DockerRuntime implements Runtime using Docker CLI. | ||
| type DockerRuntime struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DockerRuntime implementation relies on shelling out to the docker CLI. This approach is brittle and can lead to issues:
- Fragility: It depends on the
dockerbinary being in the system'sPATH. - Parsing Instability: Methods like
InspectandListparse the text output of Docker commands. This output is not a stable API and can change between Docker versions, which would breakdmrlet. - Security: While there are no obvious command injections with the current usage, shelling out is generally less secure than using a proper API.
A more robust and maintainable solution would be to use the official Docker Go SDK (github.com/docker/docker/client). It provides a stable, typed API for interacting with the Docker daemon, eliminating the need for command execution and output parsing.
pkg/dmrlet/daemon/api.go
Outdated
| if resp.StatusCode != http.StatusOK { | ||
| var errMsg string | ||
| json.NewDecoder(resp.Body).Decode(&errMsg) | ||
| return nil, fmt.Errorf("daemon error: %s", errMsg) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When an error occurs on the server, http.Error is used, which writes a plain text response. However, the client attempts to decode the error response as JSON. This will fail and result in an unhelpful error message for the user. The client should read the response body as plain text to get the actual error message from the server.
if resp.StatusCode != http.StatusOK {
body, _ := io.ReadAll(resp.Body)
return nil, fmt.Errorf("daemon error: %s", string(body))
}
pkg/dmrlet/daemon/api.go
Outdated
| buf := make([]byte, 4096) | ||
| for { | ||
| n, err := resp.Body.Read(buf) | ||
| if n > 0 { | ||
| // Parse and send log lines | ||
| // This is simplified - real implementation would properly parse | ||
| ch <- logging.LogLine{ | ||
| Message: string(buf[:n]), | ||
| } | ||
| } | ||
| if err != nil { | ||
| return | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The StreamLogs client implementation reads raw byte chunks from the HTTP response body. This can lead to garbled or incomplete log lines in the output, as a single log message might be split across multiple reads, or multiple small messages might be combined into one. It also doesn't handle client-side cancellation during streaming.
To ensure each log line is processed correctly, you should use a bufio.Scanner to read the stream line-by-line and check the context in the loop to make it more robust.
scanner := bufio.NewScanner(resp.Body)
for scanner.Scan() {
select {
case <-ctx.Done():
return
case ch <- logging.LogLine{Message: scanner.Text() + "\n"}:
}
}
cmd/dmrlet/commands/logs.go
Outdated
| if line.Timestamp.IsZero() { | ||
| fmt.Print(line.Message) | ||
| } else { | ||
| fmt.Printf("[%s] %s\n", line.Timestamp.Format("2006-01-02 15:04:05"), line.Message) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkg/dmrlet/autoscaler/metrics.go
Outdated
| // Parse Prometheus format metrics | ||
| // This is simplified - real implementation would use prometheus client | ||
| return endpointMetrics{}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkg/dmrlet/container/spec.go
Outdated
| modelFile := "/models" | ||
| // For llama.cpp, we need to specify the .gguf file | ||
| if b.config.Backend == BackendLlamaCpp { | ||
| modelFile = "/models/model.gguf" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The model file path is hardcoded to /models/model.gguf for the llama.cpp backend. This assumes that the model file within the mounted directory is always named model.gguf. This might not always be the case, making the system brittle. It would be more robust to discover the actual model filename from the model store or make it configurable.
pkg/dmrlet/daemon/daemon.go
Outdated
| // Create containers for each replica | ||
| for i := 0; i < replicas; i++ { | ||
| port := d.allocatePort() | ||
| containerID := fmt.Sprintf("%s-%d", sanitizeID(config.Model), i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The container ID is generated using a format string "%s-%d". When scaling up and down, if a container with a specific index is removed and then a new one is created, it might get the same index, leading to the same container ID. If the old container is not fully removed by the runtime yet, this can cause a name conflict.
Consider using a more robust method for generating unique container IDs, such as appending a short random string or a timestamp, to avoid potential race conditions.
dmrlet is a "Kubelet for AI" that runs inference containers directly with zero YAML overhead. It provides a simple CLI to serve models: dmrlet serve ai/smollm2 # Pulls model, starts inference container, exposes OpenAI API Key features: - Reuses existing pkg/distribution for model management - containerd integration for container lifecycle - GPU detection and passthrough (NVIDIA/AMD) - Auto port allocation (30000-30999 range) - Health checking with configurable timeout - Backend auto-detection (llama-server for GGUF, vLLM for safetensors) Commands: serve, stop, list, pull, version Signed-off-by: Eric Curtin <eric.curtin@docker.com>
dmrlet is a "Kubelet for AI" that runs inference containers directly
with zero YAML overhead. It provides a simple CLI to serve models:
dmrlet serve ai/smollm2
Pulls model, starts inference container, exposes OpenAI API
Key features:
Commands: serve, stop, list, pull, version